My Logo

1 Preface

Learning to use code to manipulate large datasets and conduct data analysis is becoming a critical skill for many professionals in the field of environmental sciences. This practical and tutorial will teach you the basics of using R to conduct data analysis using a long-term dataset of surface water quality and to interpret the results, linking real data to course theory.

1.1 Learning Outcomes

You are expected to use your own computer for all exercises. By the end of this practical, you will be able to:

  • Understand R and R Studio interface
  • Understand the fundamentals of R code
  • Perform basic data analysis using a long-term dataset of surface water quality
  • Use R to create plots and summirise the data
  • Interpret water quality metrics

1.2 How to use this material

Here I present a tutorial that will guide you on the use of different R functions. I will walk you though the tutorial during the practical. You can use this tutorial as a starting point to use R for data analysis in the future.

The tutorial is organized in 5 main sections.

  • Section 1 is a preface. It shows you the basics of R and R Studio.
  • Section 2 is a basic introduction to R code.
  • Section 3 is an introduction to the use the R package dplyer to manage large datasets. For this, we will use a water quality dataset from China.
  • Section 4 is a quick introduction to the R package ggplot to create nice plots.
  • Section 5 is an assignment. You will conduct an analysis on a similar water quality dataset for Ireland.

1.3 What is R and why use R?

R is a very powerful statistical programming language that is used broadly by researchers around the world. R is an attractive programming language because it is free, open source, and platform independent. With all the libraries that are available (and those that are in rapid development), it is quickly becoming a one-stop shop for all your analysis needs. Most academic statisticians now use R, which has allowed for greater sharing of R code or packages to implement their recommended methods. One of the very first things academics often ask when hiring someone is simply, “Can you describe your R or statistical programming experience?” It is now critical to have this background to be competitive for scientific (and many other) positions.

Among the reasons to use R you have:

  1. It’s free – open source! If you are a teacher or a student, the benefits are obvious.
  2. It runs on a variety of platforms including Windows, Unix and MacOS.
  3. It provides an unparalleled platform for programming new statistical methods in an easy and straightforward manner.
  4. It contains advanced statistical routines not yet available in other software.
  5. New add-on “packages” are being created and updated constantly.
  6. It has state-of-the-art graphics capabilities.

R does have a steep learning curve that can often be intimidating to new users, particularly those without prior coding experience. While this can be very frustrating in the initial stages, learning R is like learning a language where proficiency requires practice and continual use of the program.

Our advice is to push yourself to use this tool in everything you do. At first, R will not be the easiest or quickest option. With persistence, however, you will see the benefit of R and continue to find new ways to use it in your work.

1.4 R or RStudio? How to get them?

R is available for Linux, MacOS X, and Windows (95 or later) platforms. Software can be downloaded from one of the Comprehensive R Archive Network (CRAN) mirror sites. Once installed, R will open a console where you run code. You can also work on a script file, where you can write and save your work, and other windows that will show up on demand such as the plot tab (Fig. 1).

R console, script and plot tabs.
R console, script and plot tabs.

RStudio is an enterprise-ready professional software tool that integrates with R. It has some nice features beyond the normal R interface, which many users feel it is easier to use than R (Fig. 2). Once you have installed R, you should also download and install RStudio. For this course, we will work exclusively in RStudio.

RStudio sowfware
RStudio sowfware

Make sure you have installed R and R Studio.

The last think you need is the data we will use. To make sure we are all organised, first create a Folder in your computer for this project. Name it BL3003.

Now, inside this folder, create a new folder called Data.

Download the from this drive and paste it inside the Data folder. There are two files, China_dataset.csv and Ireland_dataset.csv.

You are now ready for class :)

2 An intro to the R programming language

2.1 Basic R concepts

There are a few concepts that are important to keep in mind before you start coding. The fact that R is a programming language may deter some users who think “I can’t program”. This should not be the case for two reasons. First, R is an interpreted language, not a compiled one, meaning that all commands typed on the keyboard are directly executed without requiring you to build a complete program like in most computer languages (C, Pascal, . . . ). Second, R’s syntax is very simple and intuitive. For instance, a linear regression can be done with the command lm(y ~ x) which means fitting a linear model with y as the response and x as a predictor.

In R, in order to be executed, a function always needs to be written with parentheses, even if there is nothing within them (e.g., ls()). If you type the name of a function without parentheses, R will display the content of the function.

When R is running, variables, data, functions, results, etc…, are stored in the active memory of the computer in the form of objects that you assign a name.The user can do actions on these objects with operators (arithmetic, logical, comparison, . . . ) and functions (which are themselves objects).

The name of an object must start with a letter (A-Z or a-z) and can be followed by letters, digits (0-9), dots (.), and underscores (_).

When referring to the directory of a folder or a data file, R uses forward slash “/”. You need to pay close attention to the direction of the slash if you copy a file path or directory from a Windows machine.

It is also important to know that R discriminates between uppercase and lowercase letters in the names of objects, so that x and X can name two distinct objects (even under Windows).


2.2 Starting with R

2.2.1 Setting your working directory

Like in many other programs, you should start your session by defining your working directory - the folder where you will work. This will be the location on your computer where any files you save will be located. To determine your current working directory, type:

getwd()

Use setwd() to change or set a new working directory. For instance, you can set your working directory to be in your Documents folder on the C: drive, or in any folder you prefer.

setwd("C:/Documents/R_Practice")

2.3 R Fundamentals

2.3.1 Data Types

There are three fundamental data types in R that you will work with in this practical:

  1. Character
  2. Numeric
  3. Integer

You can check the data type of an object using the function class(). To convert between data types you can use: as.integer(), as.numeric(), as.logical(), as. character().

For instance:

city <- 'Beijing'
class(city)
## [1] "character"
number <- 3.4
class(number)
## [1] "numeric"
Integer <- as.integer(number)
Integer
## [1] 3
class(Integer)
## [1] "integer"

2.3.2 Assigning data to objects

Since R is a programming language, we can store information as objects to avoid unnecessary repetition. Note again that values are case sensitive; ‘x’ is not the same as ‘X’!

city <- "Cork"
summary(city)

number <- 2
summary(number)

character <- as.character(2)
character

Data are very often stored in different folders to maintain an organizational pattern in your projects. In those cases, it is not necessary to re-set the working directory every time we want to import files to R that are stored in different folders, as long as these folders are within the root directory you have previously set. For instance, let’s say you have a table stored in a folder called data, which is a subfolder within your root working directory (C:/Documents/R_Practice). You can point to the data folder when reading the table as in the example below:

table <- read.csv(file="./data/TheDataIWantToReadIn.csv", header=TRUE) # read a csv table stored in the data folder

Note that because data is a subfolder in your root directory, you do not need to provide the complete directory information when reading the table “./data/TheDataIWantToReadIn.csv”. You can always provide the full directory of a data file stored on your local drive to avoid confusion.

2.3.3 Special characters

The # character is used to add comments to your code. # indicates the beginning of a comment and everything after # on a line will be ignored and not run as code. Adding comments to your code is considered good practice because it allows you to describe in plain language (for yourself or others) what your code is doing.

#This is a comment

2.4 R Data Structure

2.4.1 Vectors

Vectors are a basic data structure in R. They contain a sequence of data and can contain characters, numbers, or be TRUE/FALSE values. Remember: If you are unsure or need help, use the help function (e.g., help(seq) or ?seq). Below are several ways to create vectors in R.

1:20
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
c(1,2,3,4,5)
## [1] 1 2 3 4 5
seq(0,100,by=10)
##  [1]   0  10  20  30  40  50  60  70  80  90 100

2.4.2 Matrices and Dataframes

Matrices and dataframes are common ways to store tabular data. Understanding how to manipulate them is important to be able to conduct more complex analyses. Both matrices and dataframes are composed of rows and columns. The main difference between matrices and dataframes is that dataframes can contain many different classes of data (numeric, character, etc.), while matrices can only contain a single class.

Create a matrix with 4 rows and 5 columns using the data from x above. Consult the help (e.g., help(matrix) or ?matrix) to determine the syntax required.

x <- seq(1:20)
test_matrix <- matrix(data = x, nrow = 4, ncol = 5)
test_matrix
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    5    9   13   17
## [2,]    2    6   10   14   18
## [3,]    3    7   11   15   19
## [4,]    4    8   12   16   20
# Note, I can assign any name to an object that I create.  Generally it is best to name things in a way that is meaningful.

2.4.2.1 Subset of Matrices and Dataframes

Now, if we wanted to reference any value in the matrix, we could do so with matrix notation. The first value in matrix notation references the row and the second value references the column. COMMIT THIS TO MEMORY! I remember this by thinking Roman Catholic. So, if you wanted to view only the value in the 1st row, 5th column, you’d type:

#test_matrix(row,column)
test_matrix[1,5]
## [1] 17

In addition to using positive integers to indicate the exact location of the subset of data we want to extract, you can also use other notation to indicate subsets of data that you want to include or exclude. You can use: negative integers (to exclude data at a specific location), zero (to create empty objects with consistent format), blank spaces (to select the entire row/column), logical values (to select the data associated with TRUE values), or names (to select specific columns or rows by their names). Try to understand how each type of notation works!

For example, what if you wanted to view all the values in the 5th column? This literally says, extract all rows but only the 5th column from the object called test_matrix.

test_matrix[,5]
## [1] 17 18 19 20

What about the 4th row?

test_matrix[4,]
## [1]  4  8 12 16 20

What happens to the matrix if we append a character field? Use the cbind() (column bind) command to bind a new column, called ‘countries’. Note that I am not changing the contents of test_matrix. Can you figure out how to do a row bind (hint: use rbind())

countries <- c("United States", "Pakistan", "Ireland", "China")
cbind(test_matrix,countries)
##                             countries      
## [1,] "1" "5" "9"  "13" "17" "United States"
## [2,] "2" "6" "10" "14" "18" "Pakistan"     
## [3,] "3" "7" "11" "15" "19" "Ireland"      
## [4,] "4" "8" "12" "16" "20" "China"
#Note that I am not changing/overwriting the contents of test_matrix.  I could, but I'd have to change my code to
#test_matrix <- cbind(test_matrix,countries)

Why is everything inside the table now enclosed in quotes? Recall what we said about matrices only containing one data type. What happens if I coerce this to a dataframe?

test_dataframe <- data.frame(test_matrix,countries)
test_dataframe
##   X1 X2 X3 X4 X5     countries
## 1  1  5  9 13 17 United States
## 2  2  6 10 14 18      Pakistan
## 3  3  7 11 15 19       Ireland
## 4  4  8 12 16 20         China
# Have I changed the file type?
class(test_dataframe)
## [1] "data.frame"

Can I rename the column headings?

names(test_dataframe) <- c("Val1", "Val2", "Val3", "Val4", "Val5", "Countries")
test_dataframe
##   Val1 Val2 Val3 Val4 Val5     Countries
## 1    1    5    9   13   17 United States
## 2    2    6   10   14   18      Pakistan
## 3    3    7   11   15   19       Ireland
## 4    4    8   12   16   20         China

Can I use the same matrix notation to reference a particular row and column? Are there other ways to reference a value?

test_dataframe[3,5]
## [1] 19
test_dataframe[,5]
## [1] 17 18 19 20
test_dataframe$Val5[3]
## [1] 19
test_dataframe$Val5
## [1] 17 18 19 20
test_dataframe[,"Val5"]
## [1] 17 18 19 20

You can also use some very simple commands to determine the size of dataframes or matrices.

nrow(test_dataframe)
## [1] 4
ncol(test_dataframe)
## [1] 6
dim(test_dataframe)
## [1] 4 6

2.5 Functions

R functions can be defined as a collection of arguments structured together for carrying out a definite task. Functions have optional input and output arguments that return a value. Custom functions can be easily constructed in R. Most often, however, we will use built-in functions within base packages or other downloadable packages.

Most functions have optional arguments or are given default values (in the function’s help document, under the ‘Usage’ section, the optional arguments are given a default value following the “=” symbol). When you don’t specify the optional arguments, they will take the default values. Functions normally can be called using the following format: function_name(input_data, argument1, argument2.)

print(2+2)
## [1] 4
x <- matrix(1:10, 5, 2)
x
##      [,1] [,2]
## [1,]    1    6
## [2,]    2    7
## [3,]    3    8
## [4,]    4    9
## [5,]    5   10
y <- matrix(1:5)
y
##      [,1]
## [1,]    1
## [2,]    2
## [3,]    3
## [4,]    4
## [5,]    5
df.example <- cbind(x, y)
df.example
##      [,1] [,2] [,3]
## [1,]    1    6    1
## [2,]    2    7    2
## [3,]    3    8    3
## [4,]    4    9    4
## [5,]    5   10    5

?function_name can load the function help file. Also note that any functions in non-base packages will require installing and loading that package.

Here, for example, we install and load package named “ggplot2” that we will use for data visualization.

install.packages("ggplot2")
library(ggplot2)

2.5.1 Pre-existing Functions

R also contains many pre-existing functions in the base software. Numeric functions include sum(), mean(), sd(), min(), max(), median(), range(), quantile(), or summary(). Try a few of these on the numeric vectors you have created.

sum(x)
## [1] 55
summary(x)
##        V1          V2    
##  Min.   :1   Min.   : 6  
##  1st Qu.:2   1st Qu.: 7  
##  Median :3   Median : 8  
##  Mean   :3   Mean   : 8  
##  3rd Qu.:4   3rd Qu.: 9  
##  Max.   :5   Max.   :10
range(y)
## [1] 1 5

2.5.2 Calculations & Arithmetic Operators

R can be used to perform basic calculations and report the results back to the user.

4+2
## [1] 6
6*8
## [1] 48
(842-62)/3
## [1] 260

Exponentiation: ^

2^3
## [1] 8

Max and Min: max(), min()

vector_numbers <- c(2, 3, 4, 10)
max(vector_numbers) 
## [1] 10
min(vector_numbers)
## [1] 2

Can you calculate the square root and then subtract 5 for each element in vector_number?

2.6 Getting Help

One of the most useful commands in R is ?. At the command prompt (signified by > in your Console window), type ? followed by any command and you will be prompted with a help tab for that command (e.g., ?mean Fig. 3). You can also search through the help tab directly by searching functions on the search bar.

Getting help on R.
Getting help on R.

The internet also contains a vast quantity of useful information. There are blogs, mailing lists, and various websites (e.g., https://stackoverflow.com/) dedicated to providing information about R, its packages, and potential error messages that you may encounter (among other things). The trick is usually determining the key terms to limit your search. I generally start any web-based search with “R-Cran”, which limits and focuses the search. Using “R” as part of your key terms does not, by itself, limit the search.

3 Data Management and Data Manipulation

Now that you’ve learned the basics of R programming, we’ll take things a step further.

We’ll be working with a dataset published in the paper by Karim et al. 2025.

This is a a comprehensive surface water quality dataset assembled from a range of regional and global water quality databases, water management organizations, and individual research projects from five countries: USA, Canada, Ireland, England, and China. We will practice now with the Chinese dataset. The goal of this exercise is to test your basic skills in R programming, specifically in manipulating data.

You may not be familiar with all the operations you need to execute in this exercise. Part of the goal with this exercise, however, is for you to become more familiar with the help commands in R and with the internet solutions that exist. Our ultimate goal is to make you aware of the tools that are available to you so that you can become an effective problem solver, working independently on data analyses.


Whenever you start working with a new script, you should first set a working directory. This directory will contain all the data for your analysis and will be where you will save all the data outputs.

Remember that you can check the current working directory by typing:

getwd()
## [1] "/Users/ramirocrego/Documents/GitHub/UCC_BL3009_Practical"

Now, let’s change the working directory to the BL3009 folder you created before class.

setwd("C:/..../BL3009")

3.1 Exploring the Data

Load the China_dataset csv file I provided you. For this we will use the read.csv() funtion. Make sure the data is in your working directory. Note that I have created a folder Data that contains the csv file.

data <- read.csv("./Data/China_dataset.csv")

View the first 10 lines of the data set.

head(data, 10)
##    Country    Area Waterbody.Type       Date Ammonia..mg.l.
## 1    China Hou Bay            Bay 11-01-2001            3.5
## 2    China Hou Bay            Bay 12-02-2001            6.7
## 3    China Hou Bay            Bay 14-03-2001            4.5
## 4    China Hou Bay            Bay 17-04-2001            5.4
## 5    China Hou Bay            Bay 11-05-2001            3.3
## 6    China Hou Bay            Bay 14-06-2001            4.5
## 7    China Hou Bay            Bay 09-07-2001            1.6
## 8    China Hou Bay            Bay 22-08-2001            2.6
## 9    China Hou Bay            Bay 20-09-2001            2.8
## 10   China Hou Bay            Bay 19-10-2001            4.2
##    Biochemical.Oxygen.Demand..mg.l. Dissolved.Oxygen..mg.l.
## 1                               0.7                  8.2000
## 2                               2.1                 11.6841
## 3                               0.3                 11.6841
## 4                               5.2                 11.6841
## 5                               1.7                 11.6841
## 6                               7.7                 11.6841
## 7                               1.4                  4.5000
## 8                               3.2                 11.6841
## 9                               4.2                  1.8000
## 10                              2.8                  3.5000
##    Orthophosphate..mg.l. pH..ph.units. Temperature..cel. Nitrogen..mg.l.
## 1                   0.40           8.6              17.0            0.21
## 2                   0.72           7.3              18.6            0.13
## 3                   0.51           6.8              19.5            0.32
## 4                   0.58           7.3              23.6            0.11
## 5                   0.49           7.4              27.2            0.28
## 6                   0.32           7.3              27.2            0.19
## 7                   0.19           6.2              29.4            0.36
## 8                   0.27           7.9              30.8            0.26
## 9                   0.30           7.3              29.0            0.33
## 10                  0.46           7.3              25.3            0.35
##    Nitrate..mg.l. CCME_Values CCME_WQI
## 1            0.48    63.68388 Marginal
## 2            0.16    58.30881 Marginal
## 3            0.65    63.36902 Marginal
## 4            0.32    61.18353 Marginal
## 5            0.68    62.39702 Marginal
## 6            0.50    58.85799 Marginal
## 7            0.70    73.17806     Fair
## 8            0.33    68.12720     Fair
## 9            0.38    67.02537     Fair
## 10           0.25    61.49847 Marginal

Assess the overall structure of the data set to get a sense of the number and type of variables included. Assure that the data structure of each column of the data frame is correct and/or what you expect it to be.

str(data)
## 'data.frame':    45997 obs. of  14 variables:
##  $ Country                         : chr  "China" "China" "China" "China" ...
##  $ Area                            : chr  "Hou Bay" "Hou Bay" "Hou Bay" "Hou Bay" ...
##  $ Waterbody.Type                  : chr  "Bay" "Bay" "Bay" "Bay" ...
##  $ Date                            : chr  "11-01-2001" "12-02-2001" "14-03-2001" "17-04-2001" ...
##  $ Ammonia..mg.l.                  : num  3.5 6.7 4.5 5.4 3.3 4.5 1.6 2.6 2.8 4.2 ...
##  $ Biochemical.Oxygen.Demand..mg.l.: num  0.7 2.1 0.3 5.2 1.7 7.7 1.4 3.2 4.2 2.8 ...
##  $ Dissolved.Oxygen..mg.l.         : num  8.2 11.7 11.7 11.7 11.7 ...
##  $ Orthophosphate..mg.l.           : num  0.4 0.72 0.51 0.58 0.49 0.32 0.19 0.27 0.3 0.46 ...
##  $ pH..ph.units.                   : num  8.6 7.3 6.8 7.3 7.4 7.3 6.2 7.9 7.3 7.3 ...
##  $ Temperature..cel.               : num  17 18.6 19.5 23.6 27.2 27.2 29.4 30.8 29 25.3 ...
##  $ Nitrogen..mg.l.                 : num  0.21 0.13 0.32 0.11 0.28 0.19 0.36 0.26 0.33 0.35 ...
##  $ Nitrate..mg.l.                  : num  0.48 0.16 0.65 0.32 0.68 0.5 0.7 0.33 0.38 0.25 ...
##  $ CCME_Values                     : num  63.7 58.3 63.4 61.2 62.4 ...
##  $ CCME_WQI                        : chr  "Marginal" "Marginal" "Marginal" "Marginal" ...

Some variables appear as character, but we want them to be factors, that is, chategorical variables with levels. We can tell R to change the data type from character to factor usign the function as.factor():

data$Country <- as.factor(data$Country)
data$Area <- as.factor(data$Area)
data$Waterbody.Type <- as.factor(data$Waterbody.Type)

Check again the data structure

str(data)
## 'data.frame':    45997 obs. of  14 variables:
##  $ Country                         : Factor w/ 1 level "China": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Area                            : Factor w/ 1 level "Hou Bay": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Waterbody.Type                  : Factor w/ 1 level "Bay": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Date                            : chr  "11-01-2001" "12-02-2001" "14-03-2001" "17-04-2001" ...
##  $ Ammonia..mg.l.                  : num  3.5 6.7 4.5 5.4 3.3 4.5 1.6 2.6 2.8 4.2 ...
##  $ Biochemical.Oxygen.Demand..mg.l.: num  0.7 2.1 0.3 5.2 1.7 7.7 1.4 3.2 4.2 2.8 ...
##  $ Dissolved.Oxygen..mg.l.         : num  8.2 11.7 11.7 11.7 11.7 ...
##  $ Orthophosphate..mg.l.           : num  0.4 0.72 0.51 0.58 0.49 0.32 0.19 0.27 0.3 0.46 ...
##  $ pH..ph.units.                   : num  8.6 7.3 6.8 7.3 7.4 7.3 6.2 7.9 7.3 7.3 ...
##  $ Temperature..cel.               : num  17 18.6 19.5 23.6 27.2 27.2 29.4 30.8 29 25.3 ...
##  $ Nitrogen..mg.l.                 : num  0.21 0.13 0.32 0.11 0.28 0.19 0.36 0.26 0.33 0.35 ...
##  $ Nitrate..mg.l.                  : num  0.48 0.16 0.65 0.32 0.68 0.5 0.7 0.33 0.38 0.25 ...
##  $ CCME_Values                     : num  63.7 58.3 63.4 61.2 62.4 ...
##  $ CCME_WQI                        : chr  "Marginal" "Marginal" "Marginal" "Marginal" ...

Do you note the difference?

3.2 Working with dates

When we have dates in our data, we need to tell R that the data should be read as dates and not characters. To do that, we will use the function as.Date(). Note that you need to specify how the date is formated. There are multiple conventions, like day-month-year, or month-day-year, etc. In our case, the dates are writen as day, month, and year (e.g., “11-01-2001”)

data$Date <- as.Date(data$Date, format = "%d-%M-%Y")
str(data)
## 'data.frame':    45997 obs. of  14 variables:
##  $ Country                         : Factor w/ 1 level "China": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Area                            : Factor w/ 1 level "Hou Bay": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Waterbody.Type                  : Factor w/ 1 level "Bay": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Date                            : Date, format: "2001-02-11" "2001-02-12" ...
##  $ Ammonia..mg.l.                  : num  3.5 6.7 4.5 5.4 3.3 4.5 1.6 2.6 2.8 4.2 ...
##  $ Biochemical.Oxygen.Demand..mg.l.: num  0.7 2.1 0.3 5.2 1.7 7.7 1.4 3.2 4.2 2.8 ...
##  $ Dissolved.Oxygen..mg.l.         : num  8.2 11.7 11.7 11.7 11.7 ...
##  $ Orthophosphate..mg.l.           : num  0.4 0.72 0.51 0.58 0.49 0.32 0.19 0.27 0.3 0.46 ...
##  $ pH..ph.units.                   : num  8.6 7.3 6.8 7.3 7.4 7.3 6.2 7.9 7.3 7.3 ...
##  $ Temperature..cel.               : num  17 18.6 19.5 23.6 27.2 27.2 29.4 30.8 29 25.3 ...
##  $ Nitrogen..mg.l.                 : num  0.21 0.13 0.32 0.11 0.28 0.19 0.36 0.26 0.33 0.35 ...
##  $ Nitrate..mg.l.                  : num  0.48 0.16 0.65 0.32 0.68 0.5 0.7 0.33 0.38 0.25 ...
##  $ CCME_Values                     : num  63.7 58.3 63.4 61.2 62.4 ...
##  $ CCME_WQI                        : chr  "Marginal" "Marginal" "Marginal" "Marginal" ...

Note the new Date format.


Now, summarize the data to have a look at all variables.

summary(data)
##   Country           Area       Waterbody.Type      Date           
##  China:45997   Hou Bay:45997   Bay:45997      Min.   :2001-02-01  
##                                               1st Qu.:2005-02-11  
##                                               Median :2009-02-21  
##                                               Mean   :2009-04-29  
##                                               3rd Qu.:2014-02-01  
##                                               Max.   :2017-02-28  
##                                               NA's   :500         
##  Ammonia..mg.l.    Biochemical.Oxygen.Demand..mg.l. Dissolved.Oxygen..mg.l.
##  Min.   : 0.0050   Min.   : 0.1000                  Min.   : 0.000         
##  1st Qu.: 0.0240   1st Qu.: 0.5000                  1st Qu.: 6.000         
##  Median : 0.0460   Median : 0.7000                  Median : 7.600         
##  Mean   : 0.1007   Mean   : 0.9106                  Mean   : 8.314         
##  3rd Qu.: 0.1000   3rd Qu.: 1.1000                  3rd Qu.:11.684         
##  Max.   :10.0000   Max.   :21.0000                  Max.   :16.100         
##                                                                            
##  Orthophosphate..mg.l. pH..ph.units.   Temperature..cel. Nitrogen..mg.l.  
##  Min.   :0.00200       Min.   :2.100   Min.   :13.0      Min.   :0.00200  
##  1st Qu.:0.00600       1st Qu.:7.900   1st Qu.:19.8      1st Qu.:0.01100  
##  Median :0.01100       Median :8.000   Median :24.2      Median :0.01900  
##  Mean   :0.01841       Mean   :8.017   Mean   :23.4      Mean   :0.03261  
##  3rd Qu.:0.02100       3rd Qu.:8.200   3rd Qu.:27.0      3rd Qu.:0.03400  
##  Max.   :1.10000       Max.   :9.300   Max.   :33.2      Max.   :1.10000  
##                                                                           
##  Nitrate..mg.l.    CCME_Values       CCME_WQI        
##  Min.   :0.0020   Min.   : 51.08   Length:45997      
##  1st Qu.:0.0260   1st Qu.: 93.18   Class :character  
##  Median :0.0770   Median :100.00   Mode  :character  
##  Mean   :0.1356   Mean   : 96.51                     
##  3rd Qu.:0.1500   3rd Qu.:100.00                     
##  Max.   :5.9000   Max.   :100.00                     
## 

3.3 Data Table Manipulation with Dplyr

The most basic R skills is to query and manipulate various data tables. Table manipulation is also something that is almost always required, regardless of what you decide to apply R for. For beginners, familiarizing and reinforcing table manipulation skills to meet different needs is a great way of improving R skills. If you wish to become really good at R, but don’t know where to start, start with tables!

The base R functions that come with the default R installation have the capacity for almost all the table manipulation you will need (e.g., split(), subset(), apply(), sapply(), lapply(), tapply(), aggregate()). However, sometimes their syntax are less user-friendly and intuitive than some of the special packages built for table manipulation purposes. So, here we are introducing a few of the most useful table manipulation functions within dplyr package. This is a package I use a lot.

Note that you will have to use install.packages() and library() function to download and activate the dplyr before using it.

#install.packages("dplyr")
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Now, we will see how different functions of this package work.

3.3.1 select()

We can use select() to select column(s) that meet an specific pattern:

head(select(data, pH..ph.units.)) # select column called pH..ph.units.
##   pH..ph.units.
## 1           8.6
## 2           7.3
## 3           6.8
## 4           7.3
## 5           7.4
## 6           7.3

3.3.2 filter()

Filter/select row(s) of data based on specific requirement of column(s) values:

head(filter(data, Temperature..cel. > 20)) # select rows that have a temperature higher than 20 C
##   Country    Area Waterbody.Type       Date Ammonia..mg.l.
## 1   China Hou Bay            Bay 2001-02-17            5.4
## 2   China Hou Bay            Bay 2001-02-11            3.3
## 3   China Hou Bay            Bay 2001-02-14            4.5
## 4   China Hou Bay            Bay 2001-02-09            1.6
## 5   China Hou Bay            Bay 2001-02-22            2.6
## 6   China Hou Bay            Bay 2001-02-20            2.8
##   Biochemical.Oxygen.Demand..mg.l. Dissolved.Oxygen..mg.l.
## 1                              5.2                 11.6841
## 2                              1.7                 11.6841
## 3                              7.7                 11.6841
## 4                              1.4                  4.5000
## 5                              3.2                 11.6841
## 6                              4.2                  1.8000
##   Orthophosphate..mg.l. pH..ph.units. Temperature..cel. Nitrogen..mg.l.
## 1                  0.58           7.3              23.6            0.11
## 2                  0.49           7.4              27.2            0.28
## 3                  0.32           7.3              27.2            0.19
## 4                  0.19           6.2              29.4            0.36
## 5                  0.27           7.9              30.8            0.26
## 6                  0.30           7.3              29.0            0.33
##   Nitrate..mg.l. CCME_Values CCME_WQI
## 1           0.32    61.18353 Marginal
## 2           0.68    62.39702 Marginal
## 3           0.50    58.85799 Marginal
## 4           0.70    73.17806     Fair
## 5           0.33    68.12720     Fair
## 6           0.38    67.02537     Fair
head(filter(data, Temperature..cel. > 25 & pH..ph.units. > 8)) # select rows that have a temperature higher than 20 C and a PH higher than 7
##   Country    Area Waterbody.Type       Date Ammonia..mg.l.
## 1   China Hou Bay            Bay 2001-02-01          0.130
## 2   China Hou Bay            Bay 2001-02-01          0.120
## 3   China Hou Bay            Bay 2001-02-01          0.140
## 4   China Hou Bay            Bay 2001-02-04          0.220
## 5   China Hou Bay            Bay 2001-02-04          0.160
## 6   China Hou Bay            Bay 2001-02-04          0.074
##   Biochemical.Oxygen.Demand..mg.l. Dissolved.Oxygen..mg.l.
## 1                              0.9                 11.6841
## 2                              0.9                  5.5000
## 3                              0.6                  5.0000
## 4                              1.1                  4.3000
## 5                              1.0                  4.6000
## 6                              0.7                  4.9000
##   Orthophosphate..mg.l. pH..ph.units. Temperature..cel. Nitrogen..mg.l.
## 1                 0.017           8.3              27.5           0.011
## 2                 0.011           8.3              27.4           0.009
## 3                 0.008           8.3              27.3           0.010
## 4                 0.024           8.6              27.4           0.023
## 5                 0.019           8.7              27.3           0.015
## 6                 0.014           8.7              27.2           0.012
##   Nitrate..mg.l. CCME_Values CCME_WQI
## 1          0.020    93.17895     Good
## 2          0.034    93.18025     Good
## 3          0.033    93.18150     Good
## 4          0.150    86.38163     Good
## 5          0.110    86.38019     Good
## 6          0.060    86.38094     Good

3.3.3 pipe operator

The pipe operator allows you to pipe the output from one function to the input of the next function. Instead of nesting functions (reading from the inside to the outside), the idea of of piping is to read the functions from left to right. It can also help you avoid creating and saving a lot of intermediate variables that you don’t need to keep. The old operator for pipes was %>%, but now a new version has been introduced, |>

# old operator
pipe_result<- data %>%
  select(Temperature..cel.) %>%
  head()
head(pipe_result)
##   Temperature..cel.
## 1              17.0
## 2              18.6
## 3              19.5
## 4              23.6
## 5              27.2
## 6              27.2
# new operator
pipe_result<- data |>
  select(Temperature..cel.) |>
  head()
head(pipe_result)
##   Temperature..cel.
## 1              17.0
## 2              18.6
## 3              19.5
## 4              23.6
## 5              27.2
## 6              27.2

3.3.4 arrange()

This function arranges or re-orders rows based on their value, the rows are arranged by default in ascending order

order_data1<- data %>% 
    arrange(Temperature..cel.) 
head(order_data1)
##   Country    Area Waterbody.Type       Date Ammonia..mg.l.
## 1   China Hou Bay            Bay 2004-02-06          5.900
## 2   China Hou Bay            Bay 2004-02-09          0.015
## 3   China Hou Bay            Bay 2008-02-22          0.013
## 4   China Hou Bay            Bay 2004-02-06          4.100
## 5   China Hou Bay            Bay 2008-02-22          0.010
## 6   China Hou Bay            Bay 2008-02-15          0.054
##   Biochemical.Oxygen.Demand..mg.l. Dissolved.Oxygen..mg.l.
## 1                              4.3                     2.5
## 2                              4.2                     8.1
## 3                              1.0                     8.8
## 4                              3.5                     3.2
## 5                              0.9                     8.7
## 6                              0.9                     8.8
##   Orthophosphate..mg.l. pH..ph.units. Temperature..cel. Nitrogen..mg.l.
## 1                 0.490           7.3              13.0           0.093
## 2                 0.002           8.3              13.0           0.008
## 3                 0.020           8.2              13.2           0.019
## 4                 0.380           7.3              13.3           0.130
## 5                 0.021           8.2              13.3           0.002
## 6                 0.023           7.7              13.4           0.018
##   Nitrate..mg.l. CCME_Values  CCME_WQI
## 1          0.170    61.69006  Marginal
## 2          0.390   100.00000 Excellent
## 3          0.240   100.00000 Excellent
## 4          0.260    66.25210      Fair
## 5          0.240   100.00000 Excellent
## 6          0.019   100.00000 Excellent
order_data2<- data %>%
    arrange(Temperature..cel., pH..ph.units.)
head(order_data2)
##   Country    Area Waterbody.Type       Date Ammonia..mg.l.
## 1   China Hou Bay            Bay 2004-02-06          5.900
## 2   China Hou Bay            Bay 2004-02-09          0.015
## 3   China Hou Bay            Bay 2008-02-22          0.013
## 4   China Hou Bay            Bay 2004-02-06          4.100
## 5   China Hou Bay            Bay 2008-02-22          0.010
## 6   China Hou Bay            Bay 2008-02-15          0.054
##   Biochemical.Oxygen.Demand..mg.l. Dissolved.Oxygen..mg.l.
## 1                              4.3                     2.5
## 2                              4.2                     8.1
## 3                              1.0                     8.8
## 4                              3.5                     3.2
## 5                              0.9                     8.7
## 6                              0.9                     8.8
##   Orthophosphate..mg.l. pH..ph.units. Temperature..cel. Nitrogen..mg.l.
## 1                 0.490           7.3              13.0           0.093
## 2                 0.002           8.3              13.0           0.008
## 3                 0.020           8.2              13.2           0.019
## 4                 0.380           7.3              13.3           0.130
## 5                 0.021           8.2              13.3           0.002
## 6                 0.023           7.7              13.4           0.018
##   Nitrate..mg.l. CCME_Values  CCME_WQI
## 1          0.170    61.69006  Marginal
## 2          0.390   100.00000 Excellent
## 3          0.240   100.00000 Excellent
## 4          0.260    66.25210      Fair
## 5          0.240   100.00000 Excellent
## 6          0.019   100.00000 Excellent
# Now we learn pipe operator, can you understand what order_data1 and order_data2 are producing? 

Question: Can you arrange the table first by wt and then by hp in decending order?

3.3.5 mutate()

The mutate() command creates new column(s) and define their values. For instance, we can create a new column with just the year the data was collected. Here we use the function format and specify we want the year with "%Y":

new_col<- data %>%
    mutate(Year = format(Date, "%Y")) 
head(new_col)
##   Country    Area Waterbody.Type       Date Ammonia..mg.l.
## 1   China Hou Bay            Bay 2001-02-11            3.5
## 2   China Hou Bay            Bay 2001-02-12            6.7
## 3   China Hou Bay            Bay 2001-02-14            4.5
## 4   China Hou Bay            Bay 2001-02-17            5.4
## 5   China Hou Bay            Bay 2001-02-11            3.3
## 6   China Hou Bay            Bay 2001-02-14            4.5
##   Biochemical.Oxygen.Demand..mg.l. Dissolved.Oxygen..mg.l.
## 1                              0.7                  8.2000
## 2                              2.1                 11.6841
## 3                              0.3                 11.6841
## 4                              5.2                 11.6841
## 5                              1.7                 11.6841
## 6                              7.7                 11.6841
##   Orthophosphate..mg.l. pH..ph.units. Temperature..cel. Nitrogen..mg.l.
## 1                  0.40           8.6              17.0            0.21
## 2                  0.72           7.3              18.6            0.13
## 3                  0.51           6.8              19.5            0.32
## 4                  0.58           7.3              23.6            0.11
## 5                  0.49           7.4              27.2            0.28
## 6                  0.32           7.3              27.2            0.19
##   Nitrate..mg.l. CCME_Values CCME_WQI Year
## 1           0.48    63.68388 Marginal 2001
## 2           0.16    58.30881 Marginal 2001
## 3           0.65    63.36902 Marginal 2001
## 4           0.32    61.18353 Marginal 2001
## 5           0.68    62.39702 Marginal 2001
## 6           0.50    58.85799 Marginal 2001

Can you create a new column call zero and give it a value of 0 ?

3.3.6 summarise()

This function calculates a summary statistics among all rows or rows within certain grouping, often used in combination with group_by()

sum_table <- data %>% 
summarise(mean(pH..ph.units.))
sum_table
##   mean(pH..ph.units.)
## 1            8.016912
sum_table2 <- data%>% 
summarise(avg_PH= mean(pH..ph.units.), min_PH= min(pH..ph.units.), max_PH= max(pH..ph.units.))
sum_table2
##     avg_PH min_PH max_PH
## 1 8.016912    2.1    9.3

3.3.7 group_by()

This is a great function. group_by() divides data rows into groups based on grouping column(s) provided, often used in combination with other functions which define what you do with them after placing them in groups. When group_by() and summarise() are used together, you are essentially telling R to separate rows into different groups, and for each groups you use summarise() to generate a series of summary statistics that characterize the column values.

Let’s calculate the mean, min and max PH per year:

group_summary <- new_col |>
  group_by(Year) |>
  summarise(avg_PH= mean(pH..ph.units.), min_PH= min(pH..ph.units.), max_PH= max(pH..ph.units.))
group_summary
## # A tibble: 18 × 4
##    Year  avg_PH min_PH max_PH
##    <chr>  <dbl>  <dbl>  <dbl>
##  1 2001    8.20    4.1    9.3
##  2 2002    8.07    6.7    8.8
##  3 2003    8.17    7.1    8.7
##  4 2004    8.12    7.2    8.9
##  5 2005    8.14    7      9  
##  6 2006    8.01    6.8    8.7
##  7 2007    8.08    2.1    9.3
##  8 2008    8.15    5.9    8.9
##  9 2009    8.05    6.9    8.8
## 10 2010    7.96    7.1    8.9
## 11 2011    7.93    7      8.6
## 12 2012    7.81    6.1    8.4
## 13 2013    8.02    7      8.7
## 14 2014    7.95    7.1    8.7
## 15 2015    7.92    6.8    8.7
## 16 2016    7.86    6.5    8.6
## 17 2017    7.91    6.8    8.7
## 18 <NA>    8.13    7.2    8.8

Very cool right!!??

3.4 Conclusion on data management

This has been a glimpse to what can be done in R to work with tabular data. There are plenty other packages that in time you will learn by searching online and learning from other people, but for now, all the functions we covered are a very good set of tools to do most of what you will need.

4 An introduction to plotting using ggplot2

The package ggplot2 is widely used for data visualization in R. This package is extremely powerful. I used to like using basic R code for plotting but eventually, I had to admit that ggplot is extremely cool and had to adopt it.

ggplot2is based on the grammar of graphics, which allows users to create complex plots from data in a systematic way.

As with any R package, before you can use ggplot2, you need to install it (if you haven’t already) and load it into your R session.

#install.packages("ggplot2")
library(ggplot2)

4.1 Basic Concepts

There are a few concepts that we need to know to understand how to code ggplots.

Data: The dataset you want to visualize. Aesthetics (aes): The visual properties of the plot (e.g., x and y position, color, size). Geometries (geom_): The actual marks we put on the plot (e.g., points, lines, bars). We can, for instance, use geom_line() for plotting lines. Facets: Subplots that display subsets of the data. Scales: Control how data values are mapped to visual properties. Themes: Control the overall appearance of the plot (e.g., font size, background color).

4.2 Creating Your First Plot

Let’s start with a simple scatter plot using the dataset we have been working with. ### A basic scatter plot

First, lets filter the data to keep only year 2015. Let’s review some data manipulation steps:

# Create data column
data$Date <- as.Date(data$Date, format = "%d-%M-%Y")
# Create year column and call the new object data2
data2<- data %>%
    mutate(Year = format(Date, "%Y")) 
# Filter year 2015
data3 <- data2 |> filter(Year == 2015)

Now, we can create our first plot. Lets plot temperature against dissolved oxigen

ggplot(data = data3, aes(x = Temperature..cel., y = Dissolved.Oxygen..mg.l.)) +
  geom_point()

In here, ggplot(data = data3, aes(x = Temperature..cel., y = Dissolved.Oxygen..mg.l.)) initializes the plot with the data3 dataset, setting Temperature..cel. on the x-axis and Dissolved.Oxygen..mg.l.. geom_point() adds points to the plot to create a scatter plot.

We can see that as temperature increases, the amount of oxygen dissolved in the water decreases.

We can also see a line of dots that give you a hint that something may be wrong with the measures. But for the purpose of this practical, we can ignore that.

4.3 Customizing the plot

###Titles and labels

We now have a basic plot. Let’s start to customize it. We will first add a title, subtitle, and axis labels with the labs() function.

ggplot(data = data3, aes(x = Temperature..cel., y = Dissolved.Oxygen..mg.l.)) +
  geom_point() +
  labs(title = "Scatter Plot of temp vs diss. oxygen",
       x = "Temperature (C)",
       y = "Dissolved oxygen (mg/l)")

4.3.1 Color and size

We can also change the point color using col = "red". Try other colors, like “blue”, or “green”.

ggplot(data = data3, aes(x = Temperature..cel., y = Dissolved.Oxygen..mg.l.)) +
  geom_point(col = "red") +
  labs(title = "Scatter Plot of temp vs diss. oxygen",
       x = "Temperature (C)",
       y = "Dissolved oxygen (mg/l)")

Looking pretty good.

4.4 Using Different Geometries

Let’s try now different geometries.

4.4.1 Histogram

For a histogram we use geom_histogram().

ggplot(data = data3, aes(x = Temperature..cel.)) +
  geom_histogram(binwidth = 2, fill = "blue", color = "black") +
  labs(title = "Histogram of water temperature",
       x = "Temperature (C)",
       y = "Count")

4.4.2 Box Plot

We use geom_box() for a boxplot.

This can be useful to visualize the distribution of a variable across a different categories. For instance, we can look at the concentration of nitrogen across different water quality indexes:

ggplot(data = data3, aes(x = factor(CCME_WQI), y = Nitrogen..mg.l.)) +
  geom_boxplot() +
  labs(x = "Water Quality Index",
       y = "Nitrogen (mg/l)")

4.5 Customizing the plot appearance

4.5.1 Themes

The themes control the overall look of your plot. There are many themes available. For isntance, we can use theme_minimal().

ggplot(data = data3, aes(x = factor(CCME_WQI), y = Nitrogen..mg.l.)) +
  geom_boxplot() +
  labs(x = "Water Quality Index",
       y = "Nitrogen (mg/l)") +
  theme_minimal()

Other available themes include theme_gray(), theme_classic(), theme_bw(), and more. Check them out yourself.

With ggplot2, you have a powerful tool to explore and present your data in compelling ways.

There are many options that you can control on ggplot. The best way to learn all the possibilities is by playing with it. Pretty much, everything can be customized. Feel free to experiment with different datasets and ggplot2 functions to create the visualizations that best communicate your insights!

4.6 Saving Your Plot

To save your plot to a file use the function ggsave(). Note that first you need to save your plot as an object.

p <- ggplot(data = mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  labs(title = "Scatter Plot of MPG vs Weight")

ggsave("scatter_plot.png", plot = p, width = 6, height = 4)

4.7 A time series plot

Finally, we want to explore how different parameters change over time.

For this, we need to combine everything we learned to first create the variables we need and then plot them. A full analysis.

First, lets look at the data2 dataframe we created by using the summary() funtion:

summary(data2)
##    Country              Area           Waterbody.Type          Date           
##  Length:45997       Length:45997       Length:45997       Min.   :2001-02-01  
##  Class :character   Class :character   Class :character   1st Qu.:2005-02-11  
##  Mode  :character   Mode  :character   Mode  :character   Median :2009-02-21  
##                                                           Mean   :2009-04-29  
##                                                           3rd Qu.:2014-02-01  
##                                                           Max.   :2017-02-28  
##                                                           NA's   :500         
##  Ammonia..mg.l.    Biochemical.Oxygen.Demand..mg.l. Dissolved.Oxygen..mg.l.
##  Min.   : 0.0050   Min.   : 0.1000                  Min.   : 0.000         
##  1st Qu.: 0.0240   1st Qu.: 0.5000                  1st Qu.: 6.000         
##  Median : 0.0460   Median : 0.7000                  Median : 7.600         
##  Mean   : 0.1007   Mean   : 0.9106                  Mean   : 8.314         
##  3rd Qu.: 0.1000   3rd Qu.: 1.1000                  3rd Qu.:11.684         
##  Max.   :10.0000   Max.   :21.0000                  Max.   :16.100         
##                                                                            
##  Orthophosphate..mg.l. pH..ph.units.   Temperature..cel. Nitrogen..mg.l.  
##  Min.   :0.00200       Min.   :2.100   Min.   :13.0      Min.   :0.00200  
##  1st Qu.:0.00600       1st Qu.:7.900   1st Qu.:19.8      1st Qu.:0.01100  
##  Median :0.01100       Median :8.000   Median :24.2      Median :0.01900  
##  Mean   :0.01841       Mean   :8.017   Mean   :23.4      Mean   :0.03261  
##  3rd Qu.:0.02100       3rd Qu.:8.200   3rd Qu.:27.0      3rd Qu.:0.03400  
##  Max.   :1.10000       Max.   :9.300   Max.   :33.2      Max.   :1.10000  
##                                                                           
##  Nitrate..mg.l.    CCME_Values       CCME_WQI             Year          
##  Min.   :0.0020   Min.   : 51.08   Length:45997       Length:45997      
##  1st Qu.:0.0260   1st Qu.: 93.18   Class :character   Class :character  
##  Median :0.0770   Median :100.00   Mode  :character   Mode  :character  
##  Mean   :0.1356   Mean   : 96.51                                        
##  3rd Qu.:0.1500   3rd Qu.:100.00                                        
##  Max.   :5.9000   Max.   :100.00                                        
## 

Notice that there are many dates missing in the dataset. That will create problems, so we need to get rid of those rows. For that, we can use the function filter() and keep just data without missing dates using complete.cases(Date):

nrow(data2)
## [1] 45997
data2 <- data2 |> filter(complete.cases(Date))
nrow(data2)
## [1] 45497

Note, we have removed the 500 rows with NAs.

Now, we can create a new dataframe with the average value per year across all years:

China <- data2 |>
    mutate(Year = format(Date, "%Y")) |>
  group_by(Year) %>%
  summarise(
    Ammonia = mean(Ammonia..mg.l., na.rm = TRUE),             
    Dissolved.Oxygen = mean(Dissolved.Oxygen..mg.l., na.rm = TRUE),         
    pH = mean(pH..ph.units., na.rm = TRUE),
    Temperature = mean(Temperature..cel., na.rm = TRUE),             
    Nitrogen = mean(Nitrogen..mg.l., na.rm = TRUE),
    Nitrate = mean(Nitrate..mg.l., na.rm = TRUE),
    .groups = "drop"
  )

China$Year <- as.Date(strptime(China$Year, "%Y")) # Convert year back to date format

Finally, we can create time series to see how these parameters have changed across years:

ggplot(China, aes(x=Year, y = Ammonia)) + geom_point() + geom_line(aes(group = "1")) + geom_smooth(method = "gam", se = T, color = "blue")
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

ggplot(China, aes(x=Year, y = Dissolved.Oxygen)) + geom_point() + geom_line(aes(group = "1")) + geom_smooth(method = "gam", se = T, color = "blue")
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

ggplot(China, aes(x=Year, y = pH)) + geom_point() + geom_line(aes(group = "1")) + geom_smooth(method = "gam", se = T, color = "blue")
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

ggplot(China, aes(x=Year, y = Temperature)) + geom_point() + geom_line(aes(group = "1")) + geom_smooth(method = "gam", se = T, color = "blue")
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

ggplot(China, aes(x=Year, y = Nitrogen)) + geom_point() + geom_line(aes(group = "1")) + geom_smooth(method = "gam", se = T, color = "blue")
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

ggplot(China, aes(x=Year, y = Nitrate)) + geom_point() + geom_line(aes(group = "1")) + geom_smooth(method = "gam", se = T, color = "blue")
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

5 Assignment

This assignment is intended to practice the skills you have learned so far.

I want you to analyse the trend in different fresh water parameters but this time for a water system in Ireland.

What you need to do is:

  1. Start a new R script.

  2. Load the dataset: “Ireland_dataset.csv”

Ireland <- read.csv("./Data/Ireland_dataset.csv")
  1. Filter the water Area. Check this table and use the area name based on your group number:
Group.Number Area.Name
1 Corrib, Aille
2 Corrib, Ballyquirke
3 Corrib, Carra
4 Corrib, Bofin GY
5 Corrib, Corrib Lower
6 Lough Neagh _ Lower Bann, Emy
7 Foyle, Derg DL
8 Foyle, Finn DL
9 Foyle, Mourne DL
10 Newry, Fane, Glyde and Dee, Brackan
11 Newry, Fane, Glyde and Dee, Monalty
12 Newry, Fane, Glyde and Dee, Muckno
13 Newry, Fane, Glyde and Dee, Naglack
14 Newry, Fane, Glyde and Dee, Spring
15 Mal Bay, Doo CE
16 Mal Bay, Keagh
17 Mal Bay, Lickeen
18 Mal Bay, Naminna
19 Lough Swilly, Akibbon
20 Lough Swilly, Fern
21 Lough Swilly, Gartan
22 Tralee Bay-Feale, Cam KY
23 Tralee Bay-Feale, Gill KY
24 Lee, Cork Harbour and Youghal Bay, Allua
25 Lower Shannon (C), Graney

Hint 1: You need to load the dpylr package. Hint 2: Make sure the variable Area is in the correct structure, factor.

  1. Similar to the China example, calculate the average value of ammonia, dissolved oxygen, PH, temperature, and nitrogen per year.

Hint: You need to load the ggplot2 package.

  1. For each variable, create a time series plot showing the trend.

Hint: Adapt the code we used for the time series plot using the data from China. For more points: Customize your plots, using different colors. Make sure your variables are properly named.

  1. In a Word document, paste your entire R code and all your plots.

Good codes are those easily legible, well organised and well annotated. You will get marks accordingly.

  1. Write a short paragraph interpreting the parameter trends in the figures, including a potential explanation for the trends you observe.

  2. Submit on Canvas (Do not forget to write the names of all participants)